Problem Statement¶

Preparation Information¶

  • Developed and Analyzed by: Jerry Gonzalez
  • Cohort: November 2023 - Group D

Business Context¶

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective¶

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
  • False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
  • False positives (FP) are detections where there is no failure. These will result in inspection costs.

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description¶

  • The data provided is a transformed version of original data which was collected using sensors.
  • Train.csv - To be used for training and tuning of models.
  • Test.csv - To be used only for testing the performance of the final best model.
  • Both the datasets consist of 40 predictor variables and 1 target variable

Importing necessary libraries¶

In [285]:
# Installing the libraries with the specified version.
#!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 imbalanced-learn==0.10.1 xgboost==2.0.3 threadpoolctl==3.3.0 -q --user

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

Load Important Libraries¶

In [286]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores, and split data
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
)
from sklearn import metrics

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To impute missing values
from sklearn.impute import SimpleImputer

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To suppress warnings
import warnings

warnings.filterwarnings("ignore")

Loading Important Functions¶

In [287]:
# Purpose: Calculate various discrete statistical values for a specific column in a DataFrame
#
# Prerequisites:
#    Requires the developer to only send data that discrete statistics can safely be calculated for.
#    This function would require more extensive data validation checks and more robust exception handling.
#
# Inputs
#    data   : DataFrame object containing rows and columns of data
#    feature: str representing the column name to run statistics on
#
def calculate_statistics (data, feature):
        
    # Only calculate and print statistics if the feature is a single column string name and data is a DataFrame
    if isinstance(data,pd.DataFrame) and type(feature) == str:
        
        # For future, would like to use Describe to pull data types for each column
        # Then only perform the calculations and prints if of type Int64 or Float64
        
        # Calculate and print various discrete statistical values         
        print(f"Discrete Statistics for {feature}\n")
        print(f"Mean              : {data[feature].mean():.6f}")
        print(f"Mode              : {data[feature].mode()[0]}")
        print(f"Median            : {data[feature].median()}")
        print(f"Min               : {data[feature].min()}")
        print(f"Max               : {data[feature].max()}")   
        print(f"Standard Deviation: {data[feature].std():.6f}")
        print(f"Percentiles       : \n{data[feature].quantile([.25,.50,.75])}")
In [288]:
# Provided by GreatLearning
# function to create histogram and boxplot; both are aligned by mean
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [289]:
# Provided by GreatLearning
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [290]:
# Provided by GreatLearning
# function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0])

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
    )

    plt.tight_layout()
    plt.show()
In [291]:
# Provided by GreatLearning
# Display a stacked barplot
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

Load Model Related Functions¶

In [292]:
#Outlier detection
def outlier_detection(data):
    """
    Display a grid of box plots for each numeric feature; while showing the outlier data

    data: dataframe
    
    """
    
    # outlier detection using boxplot
    numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
    # dropping booking_status
    numeric_columns.remove("case_status")

    plt.figure(figsize=(15, 12))

    for i, variable in enumerate(numeric_columns):
        plt.subplot(4, 4, i + 1)
        plt.boxplot(data[variable], whis=1.5)
        plt.tight_layout()
        plt.title(variable)

    plt.show()
In [293]:
# Purpose: To treat outliers by clipping them to the lower and upper whisker
#
# Inputs:
#     df: Dataframe
#     col: Feature that has outliers to treat
#
# Note: This procedure is being utilized from GreatLearning; Week 4 (Hands_on_Notebook_ExploratoryDataAnalysis)
def treat_outliers(df, col):
    """
    treats outliers in a variable
    col: str, name of the numerical variable
    df: dataframe
    col: name of the column
    """
    Q1 = df[col].quantile(0.25)  # 25th quantile
    Q3 = df[col].quantile(0.75)  # 75th quantile
    IQR = Q3 - Q1                # Inter Quantile Range (75th perentile - 25th percentile)
    lower_whisker = Q1 - 1.5 * IQR
    upper_whisker = Q3 + 1.5 * IQR

    # all the values smaller than lower_whisker will be assigned the value of lower_whisker
    # all the values greater than upper_whisker will be assigned the value of upper_whisker
    # the assignment will be done by using the clip function of NumPy
    df[col] = np.clip(df[col], lower_whisker, upper_whisker)

    return df
In [294]:
# Provided by GreatLearning
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
    model, predictors, target, threshold=0.5
):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # checking which probabilities are greater than threshold
    pred_temp = model.predict(predictors) > threshold
    # rounding off the above values to get classes
    pred = np.round(pred_temp)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [295]:
# Provided by GreatLearning
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    y_pred = model.predict(predictors) > threshold
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [296]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Loading the dataset¶

In [297]:
df = pd.read_csv("./train.csv")
df_test = pd.read_csv("./test.csv")

Quick check to ensure data is read in properly¶

In [298]:
# Verify the data file was read correctly by displaying the first five rows.
df.head(5)
Out[298]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -4.465 -4.679 3.102 0.506 -0.221 -2.033 -2.911 0.051 -1.522 3.762 -5.715 0.736 0.981 1.418 -3.376 -3.047 0.306 2.914 2.270 4.395 -2.388 0.646 -1.191 3.133 0.665 -2.511 -0.037 0.726 -3.982 -1.073 1.667 3.060 -1.690 2.846 2.235 6.667 0.444 -2.369 2.951 -3.480 0
1 3.366 3.653 0.910 -1.368 0.332 2.359 0.733 -4.332 0.566 -0.101 1.914 -0.951 -1.255 -2.707 0.193 -4.769 -2.205 0.908 0.757 -5.834 -3.065 1.597 -1.757 1.766 -0.267 3.625 1.500 -0.586 0.783 -0.201 0.025 -1.795 3.033 -2.468 1.895 -2.298 -1.731 5.909 -0.386 0.616 0
2 -3.832 -5.824 0.634 -2.419 -1.774 1.017 -2.099 -3.173 -2.082 5.393 -0.771 1.107 1.144 0.943 -3.164 -4.248 -4.039 3.689 3.311 1.059 -2.143 1.650 -1.661 1.680 -0.451 -4.551 3.739 1.134 -2.034 0.841 -1.600 -0.257 0.804 4.086 2.292 5.361 0.352 2.940 3.839 -4.309 0
3 1.618 1.888 7.046 -1.147 0.083 -1.530 0.207 -2.494 0.345 2.119 -3.053 0.460 2.705 -0.636 -0.454 -3.174 -3.404 -1.282 1.582 -1.952 -3.517 -1.206 -5.628 -1.818 2.124 5.295 4.748 -2.309 -3.963 -6.029 4.949 -3.584 -2.577 1.364 0.623 5.550 -1.527 0.139 3.101 -1.277 0
4 -0.111 3.872 -3.758 -2.983 3.793 0.545 0.205 4.849 -1.855 -6.220 1.998 4.724 0.709 -1.989 -2.633 4.184 2.245 3.734 -6.313 -5.380 -0.887 2.062 9.446 4.490 -3.945 4.582 -8.780 -3.383 5.107 6.788 2.044 8.266 6.629 -10.069 1.223 -3.230 1.687 -2.164 -3.645 6.510 0
In [299]:
# Verify the entire data file was read correctly by displaying the last five rows.
df.tail(5)
Out[299]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
19995 -2.071 -1.088 -0.796 -3.012 -2.288 2.807 0.481 0.105 -0.587 -2.899 8.868 1.717 1.358 -1.777 0.710 4.945 -3.100 -1.199 -1.085 -0.365 3.131 -3.948 -3.578 -8.139 -1.937 -1.328 -0.403 -1.735 9.996 6.955 -3.938 -8.274 5.745 0.589 -0.650 -3.043 2.216 0.609 0.178 2.928 1
19996 2.890 2.483 5.644 0.937 -1.381 0.412 -1.593 -5.762 2.150 0.272 -2.095 -1.526 0.072 -3.540 -2.762 -10.632 -0.495 1.720 3.872 -1.210 -8.222 2.121 -5.492 1.452 1.450 3.685 1.077 -0.384 -0.839 -0.748 -1.089 -4.159 1.181 -0.742 5.369 -0.693 -1.669 3.660 0.820 -1.987 0
19997 -3.897 -3.942 -0.351 -2.417 1.108 -1.528 -3.520 2.055 -0.234 -0.358 -3.782 2.180 6.112 1.985 -8.330 -1.639 -0.915 5.672 -3.924 2.133 -4.502 2.777 5.728 1.620 -1.700 -0.042 -2.923 -2.760 -2.254 2.552 0.982 7.112 1.476 -3.954 1.856 5.029 2.083 -6.409 1.477 -0.874 0
19998 -3.187 -10.052 5.696 -4.370 -5.355 -1.873 -3.947 0.679 -2.389 5.457 1.583 3.571 9.227 2.554 -7.039 -0.994 -9.665 1.155 3.877 3.524 -7.015 -0.132 -3.446 -4.801 -0.876 -3.812 5.422 -3.732 0.609 5.256 1.915 0.403 3.164 3.752 8.530 8.451 0.204 -7.130 4.249 -6.112 0
19999 -2.687 1.961 6.137 2.600 2.657 -4.291 -2.344 0.974 -1.027 0.497 -9.589 3.177 1.055 -1.416 -4.669 -5.405 3.720 2.893 2.329 1.458 -6.429 1.818 0.806 7.786 0.331 5.257 -4.867 -0.819 -5.667 -2.861 4.674 6.621 -1.989 -1.349 3.952 5.450 -0.455 -2.202 1.678 -1.974 0
In [300]:
# Verify the data file was read correctly by displaying the first five rows.
df_test.head(5)
Out[300]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -0.613 -3.820 2.202 1.300 -1.185 -4.496 -1.836 4.723 1.206 -0.342 -5.123 1.017 4.819 3.269 -2.984 1.387 2.032 -0.512 -1.023 7.339 -2.242 0.155 2.054 -2.772 1.851 -1.789 -0.277 -1.255 -3.833 -1.505 1.587 2.291 -5.411 0.870 0.574 4.157 1.428 -10.511 0.455 -1.448 0
1 0.390 -0.512 0.527 -2.577 -1.017 2.235 -0.441 -4.406 -0.333 1.967 1.797 0.410 0.638 -1.390 -1.883 -5.018 -3.827 2.418 1.762 -3.242 -3.193 1.857 -1.708 0.633 -0.588 0.084 3.014 -0.182 0.224 0.865 -1.782 -2.475 2.494 0.315 2.059 0.684 -0.485 5.128 1.721 -1.488 0
2 -0.875 -0.641 4.084 -1.590 0.526 -1.958 -0.695 1.347 -1.732 0.466 -4.928 3.565 -0.449 -0.656 -0.167 -1.630 2.292 2.396 0.601 1.794 -2.120 0.482 -0.841 1.790 1.874 0.364 -0.169 -0.484 -2.119 -2.157 2.907 -1.319 -2.997 0.460 0.620 5.632 1.324 -1.752 1.808 1.676 0
3 0.238 1.459 4.015 2.534 1.197 -3.117 -0.924 0.269 1.322 0.702 -5.578 -0.851 2.591 0.767 -2.391 -2.342 0.572 -0.934 0.509 1.211 -3.260 0.105 -0.659 1.498 1.100 4.143 -0.248 -1.137 -5.356 -4.546 3.809 3.518 -3.074 -0.284 0.955 3.029 -1.367 -3.412 0.906 -2.451 0
4 5.828 2.768 -1.235 2.809 -1.642 -1.407 0.569 0.965 1.918 -2.775 -0.530 1.375 -0.651 -1.679 -0.379 -4.443 3.894 -0.608 2.945 0.367 -5.789 4.598 4.450 3.225 0.397 0.248 -2.362 1.079 -0.473 2.243 -3.591 1.774 -1.502 -2.227 4.777 -6.560 -0.806 -0.276 -3.858 -0.538 0
In [301]:
# Verify the entire data file was read correctly by displaying the last five rows.
df_test.tail(5)
Out[301]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
4995 -5.120 1.635 1.251 4.036 3.291 -2.932 -1.329 1.754 -2.985 1.249 -6.878 3.715 -2.512 -1.395 -2.554 -2.197 4.772 2.403 3.792 0.487 -2.028 1.778 3.668 11.375 -1.977 2.252 -7.319 1.907 -3.734 -0.012 2.120 9.979 0.063 0.217 3.036 2.109 -0.557 1.939 0.513 -2.694 0
4996 -5.172 1.172 1.579 1.220 2.530 -0.669 -2.618 -2.001 0.634 -0.579 -3.671 0.460 3.321 -1.075 -7.113 -4.356 -0.001 3.698 -0.846 -0.222 -3.645 0.736 0.926 3.278 -2.277 4.458 -4.543 -1.348 -1.779 0.352 -0.214 4.424 2.604 -2.152 0.917 2.157 0.467 0.470 2.197 -2.377 0
4997 -1.114 -0.404 -1.765 -5.879 3.572 3.711 -2.483 -0.308 -0.922 -2.999 -0.112 -1.977 -1.623 -0.945 -2.735 -0.813 0.610 8.149 -9.199 -3.872 -0.296 1.468 2.884 2.792 -1.136 1.198 -4.342 -2.869 4.124 4.197 3.471 3.792 7.482 -10.061 -0.387 1.849 1.818 -1.246 -1.261 7.475 0
4998 -1.703 0.615 6.221 -0.104 0.956 -3.279 -1.634 -0.104 1.388 -1.066 -7.970 2.262 3.134 -0.486 -3.498 -4.562 3.136 2.536 -0.792 4.398 -4.073 -0.038 -2.371 -1.542 2.908 3.215 -0.169 -1.541 -4.724 -5.525 1.668 -4.100 -5.949 0.550 -1.574 6.824 2.139 -4.036 3.436 0.579 0
4999 -0.604 0.960 -0.721 8.230 -1.816 -2.276 -2.575 -1.041 4.130 -2.731 -3.292 -1.674 0.465 -1.646 -5.263 -7.988 6.480 0.226 4.963 6.752 -6.306 3.271 1.897 3.271 -0.637 -0.925 -6.759 2.990 -0.814 3.499 -8.435 2.370 -1.062 0.791 4.952 -7.441 -0.070 -0.918 -2.291 -5.363 0

Data Overview¶

  • Observations
  • Sanity checks
In [302]:
# Let's make a copy of our data sets
data = df.copy()
data_test = df_test.copy()
In [303]:
#Check the size of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} features in the data frame.")
There are 20000 rows and 41 features in the data frame.
In [304]:
#Check the size of the data
print(f"There are {data_test.shape[0]} rows and {data.shape[1]} features in the test data frame.")
There are 5000 rows and 41 features in the test data frame.

Check the data types¶

In [305]:
# let's check the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      19982 non-null  float64
 1   V2      19982 non-null  float64
 2   V3      20000 non-null  float64
 3   V4      20000 non-null  float64
 4   V5      20000 non-null  float64
 5   V6      20000 non-null  float64
 6   V7      20000 non-null  float64
 7   V8      20000 non-null  float64
 8   V9      20000 non-null  float64
 9   V10     20000 non-null  float64
 10  V11     20000 non-null  float64
 11  V12     20000 non-null  float64
 12  V13     20000 non-null  float64
 13  V14     20000 non-null  float64
 14  V15     20000 non-null  float64
 15  V16     20000 non-null  float64
 16  V17     20000 non-null  float64
 17  V18     20000 non-null  float64
 18  V19     20000 non-null  float64
 19  V20     20000 non-null  float64
 20  V21     20000 non-null  float64
 21  V22     20000 non-null  float64
 22  V23     20000 non-null  float64
 23  V24     20000 non-null  float64
 24  V25     20000 non-null  float64
 25  V26     20000 non-null  float64
 26  V27     20000 non-null  float64
 27  V28     20000 non-null  float64
 28  V29     20000 non-null  float64
 29  V30     20000 non-null  float64
 30  V31     20000 non-null  float64
 31  V32     20000 non-null  float64
 32  V33     20000 non-null  float64
 33  V34     20000 non-null  float64
 34  V35     20000 non-null  float64
 35  V36     20000 non-null  float64
 36  V37     20000 non-null  float64
 37  V38     20000 non-null  float64
 38  V39     20000 non-null  float64
 39  V40     20000 non-null  float64
 40  Target  20000 non-null  int64  
dtypes: float64(40), int64(1)
memory usage: 6.3 MB

Observations¶

  • All independent variables (V1-V40) are of type float64
  • The dependent variable (Target) is of type int64
  • V1 and V2 are each missing 18 values
In [306]:
#Show the statistical summary of the data
data.describe(include='all').T
Out[306]:
count mean std min 25% 50% 75% max
V1 19982.000 -0.272 3.442 -11.876 -2.737 -0.748 1.840 15.493
V2 19982.000 0.440 3.151 -12.320 -1.641 0.472 2.544 13.089
V3 20000.000 2.485 3.389 -10.708 0.207 2.256 4.566 17.091
V4 20000.000 -0.083 3.432 -15.082 -2.348 -0.135 2.131 13.236
V5 20000.000 -0.054 2.105 -8.603 -1.536 -0.102 1.340 8.134
V6 20000.000 -0.995 2.041 -10.227 -2.347 -1.001 0.380 6.976
V7 20000.000 -0.879 1.762 -7.950 -2.031 -0.917 0.224 8.006
V8 20000.000 -0.548 3.296 -15.658 -2.643 -0.389 1.723 11.679
V9 20000.000 -0.017 2.161 -8.596 -1.495 -0.068 1.409 8.138
V10 20000.000 -0.013 2.193 -9.854 -1.411 0.101 1.477 8.108
V11 20000.000 -1.895 3.124 -14.832 -3.922 -1.921 0.119 11.826
V12 20000.000 1.605 2.930 -12.948 -0.397 1.508 3.571 15.081
V13 20000.000 1.580 2.875 -13.228 -0.224 1.637 3.460 15.420
V14 20000.000 -0.951 1.790 -7.739 -2.171 -0.957 0.271 5.671
V15 20000.000 -2.415 3.355 -16.417 -4.415 -2.383 -0.359 12.246
V16 20000.000 -2.925 4.222 -20.374 -5.634 -2.683 -0.095 13.583
V17 20000.000 -0.134 3.345 -14.091 -2.216 -0.015 2.069 16.756
V18 20000.000 1.189 2.592 -11.644 -0.404 0.883 2.572 13.180
V19 20000.000 1.182 3.397 -13.492 -1.050 1.279 3.493 13.238
V20 20000.000 0.024 3.669 -13.923 -2.433 0.033 2.512 16.052
V21 20000.000 -3.611 3.568 -17.956 -5.930 -3.533 -1.266 13.840
V22 20000.000 0.952 1.652 -10.122 -0.118 0.975 2.026 7.410
V23 20000.000 -0.366 4.032 -14.866 -3.099 -0.262 2.452 14.459
V24 20000.000 1.134 3.912 -16.387 -1.468 0.969 3.546 17.163
V25 20000.000 -0.002 2.017 -8.228 -1.365 0.025 1.397 8.223
V26 20000.000 1.874 3.435 -11.834 -0.338 1.951 4.130 16.836
V27 20000.000 -0.612 4.369 -14.905 -3.652 -0.885 2.189 17.560
V28 20000.000 -0.883 1.918 -9.269 -2.171 -0.891 0.376 6.528
V29 20000.000 -0.986 2.684 -12.579 -2.787 -1.176 0.630 10.722
V30 20000.000 -0.016 3.005 -14.796 -1.867 0.184 2.036 12.506
V31 20000.000 0.487 3.461 -13.723 -1.818 0.490 2.731 17.255
V32 20000.000 0.304 5.500 -19.877 -3.420 0.052 3.762 23.633
V33 20000.000 0.050 3.575 -16.898 -2.243 -0.066 2.255 16.692
V34 20000.000 -0.463 3.184 -17.985 -2.137 -0.255 1.437 14.358
V35 20000.000 2.230 2.937 -15.350 0.336 2.099 4.064 15.291
V36 20000.000 1.515 3.801 -14.833 -0.944 1.567 3.984 19.330
V37 20000.000 0.011 1.788 -5.478 -1.256 -0.128 1.176 7.467
V38 20000.000 -0.344 3.948 -17.375 -2.988 -0.317 2.279 15.290
V39 20000.000 0.891 1.753 -6.439 -0.272 0.919 2.058 7.760
V40 20000.000 -0.876 3.012 -11.024 -2.940 -0.921 1.120 10.654
Target 20000.000 0.056 0.229 0.000 0.000 0.000 0.000 1.000

Observations¶

  • At this point, the describe command does not offer anything too variable.
  • Also difficult since the feature names have been encoded.
  • Let's explore for any duplicate or missing values.

Checking for duplicate values¶

In [307]:
data.nunique()
Out[307]:
V1        19982
V2        19982
V3        20000
V4        20000
V5        20000
V6        20000
V7        20000
V8        20000
V9        20000
V10       20000
V11       20000
V12       20000
V13       20000
V14       20000
V15       20000
V16       20000
V17       20000
V18       20000
V19       20000
V20       20000
V21       20000
V22       20000
V23       20000
V24       20000
V25       20000
V26       20000
V27       20000
V28       20000
V29       20000
V30       20000
V31       20000
V32       20000
V33       20000
V34       20000
V35       20000
V36       20000
V37       20000
V38       20000
V39       20000
V40       20000
Target        2
dtype: int64

Observations¶

  • All V1-V40 values are unique. Since these features are obtained from sensor data this makes sense.
  • Two unique values for Target, which is also acceptable.

Checking for missing values¶

In [308]:
# Check for missing values.
test_results = data.isnull().sum()
test_results[test_results>0]
Out[308]:
V1    18
V2    18
dtype: int64
In [309]:
#Let's investigate the rows that have a missing V1 value
data[data['V1'].isnull()]
Out[309]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
89 NaN -3.961 2.788 -4.713 -3.007 -1.541 -0.881 1.477 0.575 -1.101 -1.847 4.541 4.490 0.710 -2.138 -2.026 0.136 2.792 -1.167 4.870 -3.924 1.493 -0.173 -6.471 3.008 -3.134 3.956 -1.898 -0.642 -0.538 -1.876 -8.326 -5.141 1.121 -0.306 5.315 3.750 -5.631 2.372 2.196 0
5941 NaN 1.008 1.228 5.397 0.064 -2.707 -2.028 0.534 3.007 -2.362 -5.713 -1.620 -0.046 -0.511 -3.030 -4.996 6.425 0.773 1.235 5.860 -3.851 1.707 1.016 2.310 1.162 0.388 -4.908 1.453 -2.539 -0.518 -2.749 1.870 -3.115 -0.550 1.714 -2.257 0.411 -3.434 -1.299 -1.769 0
6317 NaN -5.205 1.998 -3.708 -1.042 -1.593 -2.653 0.852 -1.310 2.407 -2.696 3.517 6.080 1.893 -6.296 -2.354 -3.713 4.059 -0.373 1.624 -5.273 2.433 2.354 0.062 -0.469 -1.308 1.865 -2.446 -2.908 1.166 1.492 3.074 -0.068 -0.278 3.197 7.016 1.302 -4.580 2.956 -2.363 0
6464 NaN 2.146 5.004 4.192 1.428 -6.438 -0.931 3.794 -0.683 -0.739 -8.189 6.676 4.109 -0.653 -4.763 -1.715 4.042 -0.464 4.026 3.830 -5.310 0.926 2.933 4.457 -0.354 4.864 -5.043 -0.770 -5.669 -2.644 1.855 5.231 -5.113 1.746 2.587 3.991 0.611 -4.273 1.865 -3.599 0
7073 NaN 2.534 2.763 -1.674 -1.942 -0.030 0.911 -3.200 2.949 -0.413 0.013 -0.483 2.908 -0.942 -0.655 -6.153 -2.604 -0.674 0.767 -2.704 -6.404 2.858 -1.414 -2.859 2.362 3.168 5.590 -1.769 -2.734 -3.304 -0.201 -4.887 -2.612 -1.501 2.036 -0.829 -1.370 0.572 -0.132 -0.322 0
8431 NaN -1.399 -2.008 -1.750 0.932 -1.290 -0.270 4.459 -2.776 -1.212 -2.049 5.283 -0.872 0.068 -0.667 1.865 3.443 3.297 -0.930 0.944 -0.558 2.547 6.471 4.467 -0.811 -2.225 -3.844 0.170 0.232 2.963 0.415 4.560 -0.421 -2.037 1.110 1.521 2.114 -2.253 -0.939 2.542 0
8439 NaN -3.841 0.197 4.148 1.151 -0.993 -4.732 0.559 -0.927 0.458 -4.889 -1.247 -1.653 -0.235 -5.407 -2.989 4.834 4.638 1.297 6.399 -1.092 0.134 0.410 6.207 -1.939 -2.996 -8.530 2.124 0.821 4.871 -2.013 6.819 3.451 0.242 3.216 1.203 1.275 -1.921 0.579 -2.838 0
11156 NaN -0.667 3.716 4.934 1.668 -4.356 -2.823 0.373 -0.710 2.177 -8.808 2.562 1.959 0.005 -5.940 -4.676 3.292 1.975 4.434 4.713 -4.124 1.048 0.859 6.753 -0.812 1.876 -4.789 1.248 -6.278 -2.253 0.464 6.663 -2.898 3.068 2.487 4.809 0.069 -1.216 3.014 -5.973 0
11287 NaN -2.562 -0.181 -7.195 -1.044 1.385 1.306 1.559 -2.992 1.275 3.033 3.689 0.522 0.753 2.457 3.192 -4.054 1.523 -2.112 -3.494 0.554 0.755 1.150 -2.128 0.731 -2.165 5.066 -2.036 1.563 0.856 3.188 -2.532 0.560 -1.154 -0.019 4.065 0.979 -0.571 0.630 3.919 0
11456 NaN 1.300 4.383 1.583 -0.077 0.659 -1.639 -4.815 -0.915 2.812 0.572 -0.319 0.853 -2.777 -3.633 -5.402 -4.239 0.261 5.218 -3.446 -4.544 -0.524 -5.112 3.633 -2.315 4.270 -0.810 -0.532 0.693 1.787 0.724 1.772 5.755 1.204 5.664 0.414 -2.644 5.530 2.105 -4.945 0
12221 NaN -2.326 -0.052 0.615 -0.896 -2.437 0.350 2.093 -2.934 2.291 -3.838 6.294 -1.584 0.012 0.547 -0.998 3.333 1.319 5.203 3.560 -0.647 2.200 2.725 4.346 0.560 -4.238 -0.249 2.953 -3.262 -0.752 -2.262 0.135 -5.183 5.252 0.716 3.211 1.642 1.544 1.805 -2.040 0
12447 NaN 0.753 -0.271 1.301 2.039 -1.485 -0.412 0.981 0.810 -0.065 -3.844 -1.009 1.098 1.431 -1.497 0.018 1.403 0.469 -2.055 0.628 0.045 0.566 2.473 1.881 0.200 1.757 -1.190 -0.288 -3.974 -3.101 2.092 4.410 -2.209 -1.359 -1.726 1.679 -0.209 -2.336 0.112 -0.543 0
13086 NaN 2.056 3.331 2.741 2.783 -0.444 -2.015 -0.887 -1.111 0.025 -2.753 -1.148 -1.543 -2.020 -2.344 -1.388 1.272 1.224 0.750 -0.925 -0.823 -1.865 -2.626 5.158 -1.809 4.433 -5.879 -0.431 0.966 1.189 3.295 5.112 4.675 -1.710 2.430 0.997 -1.191 1.207 0.511 -0.884 0
13411 NaN 2.705 4.587 1.868 2.050 -0.925 -1.669 -1.654 -0.243 -0.317 -2.224 0.258 1.562 -2.228 -3.846 -2.398 -0.656 0.637 1.076 -1.443 -2.758 -1.739 -3.150 2.459 -1.692 6.165 -3.977 -1.734 0.289 0.199 2.580 2.527 3.625 -1.200 2.328 1.667 -0.943 0.947 1.655 -1.665 0
14202 NaN 7.039 2.145 -3.202 4.113 3.376 -1.337 -4.546 1.941 -5.467 2.364 -1.338 3.052 -4.598 -6.043 -4.133 -2.799 4.435 -6.633 -8.543 -4.267 -0.383 -1.141 -0.153 -3.116 11.244 -5.046 -5.440 5.035 2.808 1.920 0.158 9.768 -10.258 0.514 -1.975 -0.029 3.127 0.009 4.538 0
15520 NaN 1.383 3.237 -3.818 -1.917 0.438 1.348 -2.036 1.156 0.307 2.234 0.628 3.356 -0.483 0.548 -2.162 -5.072 -1.413 -0.092 -3.925 -4.032 0.784 -2.563 -4.674 1.767 2.998 6.633 -2.927 -0.687 -2.376 2.066 -5.415 -0.897 -1.058 1.417 1.162 -1.147 -0.048 0.605 0.815 0
16576 NaN 3.934 -0.762 2.652 1.754 -0.554 1.829 -0.105 -3.737 1.037 -0.359 5.859 -4.206 -3.349 1.476 -0.451 2.342 -0.376 6.431 -3.529 0.458 0.970 2.185 8.724 -2.764 1.919 -4.303 2.849 -0.029 1.116 -1.477 3.486 1.028 2.846 1.744 -2.000 -0.783 8.698 0.352 -2.005 0
18104 NaN 1.492 2.659 0.223 -0.304 -1.347 0.044 -0.159 1.108 -0.573 -2.281 0.316 1.005 -0.495 -0.360 -2.629 0.661 -0.311 0.490 0.092 -3.322 1.033 -0.598 -0.154 1.547 2.155 0.984 -0.863 -2.067 -2.184 1.339 -1.007 -2.230 -0.871 1.300 0.668 -0.503 -1.485 -0.154 0.157 0
In [310]:
#Let's investigate the rows that have a missing V2 value
data[data['V2'].isnull()]
Out[310]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
613 -2.049 NaN -1.624 -3.324 0.152 0.600 -1.813 0.852 -1.523 0.211 -0.460 2.380 1.676 0.529 -3.768 -1.096 -0.785 4.855 -1.961 0.047 -2.195 2.567 3.988 2.068 -1.312 -2.227 -1.315 -0.934 0.535 3.590 -0.471 3.264 2.379 -2.457 1.719 2.537 1.702 -1.435 0.597 0.739 0
2236 -3.761 NaN 0.195 -1.638 1.261 -1.574 -3.686 1.576 -0.310 -0.138 -4.495 1.817 5.029 1.437 -8.109 -2.803 -0.187 5.801 -3.025 2.019 -5.083 3.033 5.197 3.117 -1.580 0.259 -3.535 -2.270 -2.474 2.470 1.162 7.621 1.695 -3.956 2.708 4.657 1.619 -5.537 1.247 -1.163 0
2508 -1.431 NaN 0.660 -2.876 1.150 -0.786 -1.560 2.899 -2.347 -0.218 -1.131 2.931 2.053 0.375 -3.123 1.321 -1.053 3.188 -2.288 -1.314 -2.461 1.292 3.694 3.003 -1.523 0.904 -2.650 -2.502 0.678 3.295 3.915 6.279 3.324 -4.048 3.119 3.336 0.604 -3.782 -0.157 1.503 0
4653 5.466 NaN 4.541 -2.917 0.400 2.799 0.029 -7.334 1.123 1.695 1.165 -2.778 0.571 -3.078 -1.388 -8.513 -6.208 1.401 0.769 -9.145 -6.873 2.065 -4.812 1.897 0.338 7.160 4.653 -2.619 -1.107 -2.284 3.652 -1.536 4.596 -4.104 4.296 0.153 -3.727 6.563 0.706 -0.462 0
6810 -2.631 NaN 2.330 1.090 0.604 -1.139 -0.690 -1.359 0.356 -1.189 -1.703 3.141 2.523 -2.171 -3.983 -3.457 0.497 1.160 1.968 0.019 -3.499 0.381 -0.338 0.911 -1.197 3.694 -2.561 -0.729 -0.450 0.165 -1.960 -0.950 0.210 0.449 1.046 0.537 0.763 1.729 1.886 -1.702 0
7788 -4.203 NaN 2.954 0.584 4.104 -0.639 -2.811 -0.112 -1.363 -0.800 -1.392 0.420 3.812 -1.782 -7.549 -1.170 -3.184 2.585 -1.856 -5.779 -4.962 -0.045 1.937 6.762 -4.828 9.171 -7.403 -4.276 0.950 3.959 6.185 12.522 9.502 -7.153 5.669 1.250 -2.159 -0.954 -0.002 -1.547 0
8483 -4.484 NaN 1.201 -2.042 2.779 -0.802 -5.404 -1.225 1.486 -0.974 -5.913 -0.329 7.565 0.805 -12.687 -7.009 -1.561 8.508 -5.537 0.200 -8.388 4.009 5.066 3.765 -2.405 4.073 -4.742 -4.100 -3.459 2.146 1.662 9.467 4.281 -7.588 3.267 5.232 1.279 -5.371 1.984 -1.643 0
8894 3.264 NaN 8.447 -3.253 -3.418 -2.996 -0.669 -0.161 -0.667 3.134 -2.112 3.735 5.746 0.330 -1.831 -3.277 -5.365 -1.125 3.783 0.579 -7.446 0.403 -4.710 -3.815 2.681 1.785 7.026 -3.364 -3.217 -2.715 4.555 -4.243 -3.123 2.522 5.284 7.291 -0.868 -4.315 3.124 -2.393 0
8947 -3.793 NaN 0.720 2.306 0.935 -0.984 0.505 -0.441 -2.767 1.735 -1.988 4.212 -2.798 -2.083 0.342 -1.369 2.095 0.307 5.488 -0.388 0.089 0.326 0.122 6.040 -1.381 0.375 -2.734 2.510 -1.072 -0.054 -1.293 1.528 -0.497 3.790 1.131 0.618 -0.111 5.709 1.542 -2.481 0
9362 2.662 NaN 2.980 4.431 -0.238 0.672 0.380 -7.647 4.435 -0.746 -1.169 -3.067 0.025 -3.767 -1.931 -10.298 0.341 -1.307 4.457 -2.175 -5.360 1.257 -5.030 0.454 0.703 6.003 0.909 1.180 -2.527 -4.018 -4.607 -5.494 -1.105 1.225 0.976 -4.794 -2.269 7.671 0.825 -3.929 0
9425 -2.354 NaN 2.054 0.812 2.540 -0.925 -0.208 -0.563 -0.140 -2.147 -3.838 2.682 -0.660 -2.519 -1.708 -2.675 3.630 2.293 -0.160 -0.368 -1.414 0.225 0.243 2.928 -0.190 4.111 -4.003 -0.160 -0.929 -1.678 -0.042 -0.621 -0.897 -1.181 -1.237 1.237 1.228 2.074 1.224 1.472 0
9848 -1.764 NaN 2.845 -2.753 -0.812 -0.101 -1.382 -1.105 -0.054 0.160 0.640 2.035 4.863 -0.351 -4.249 -1.557 -3.843 1.644 -0.471 -0.326 -3.334 -0.352 -1.690 -3.143 -0.703 1.791 1.293 -2.779 0.840 1.251 0.264 -2.159 1.860 -0.337 1.509 3.408 0.923 -1.503 2.515 -0.794 0
11637 -2.271 NaN 1.710 1.158 -0.355 -5.449 -0.786 3.936 -1.576 0.801 -8.512 8.426 2.662 0.696 -3.692 -3.227 5.014 2.677 4.117 5.919 -5.061 4.175 5.949 4.687 1.123 -1.937 -1.736 1.307 -7.059 -2.439 -1.546 2.651 -8.429 3.511 1.500 5.552 2.589 -3.453 2.324 -2.760 0
12339 -1.664 NaN -0.712 -4.347 1.392 -0.094 -2.163 -0.381 0.031 -0.659 -5.653 2.888 2.208 0.552 -5.221 -5.363 2.142 8.083 -4.127 1.704 -3.908 4.500 4.886 2.087 0.979 -1.480 -0.362 -0.818 -3.844 -1.256 -1.122 0.307 -2.691 -3.112 -1.596 5.821 3.462 -1.737 2.291 2.241 0
15913 0.768 NaN 5.296 0.043 -1.174 -2.249 0.956 -0.090 -0.242 -1.061 -2.449 5.086 0.434 -2.633 0.849 -2.631 2.178 -0.845 3.864 1.723 -2.994 -0.466 -3.444 -1.775 2.113 2.187 0.926 -0.192 -0.633 -2.589 -0.803 -7.720 -4.519 3.182 0.453 2.175 1.262 0.893 2.027 0.633 0
18342 -0.929 NaN 2.376 -1.237 3.229 -2.100 -2.190 0.589 1.956 -5.008 -7.388 3.314 3.774 -1.836 -7.099 -6.071 4.892 6.479 -4.841 0.968 -6.694 3.470 4.668 2.432 0.399 5.752 -5.572 -2.882 -2.986 -1.455 0.333 1.613 -1.821 -6.665 -0.455 3.055 2.935 -3.791 0.863 3.336 0
18343 -2.377 NaN -0.009 -1.472 1.295 0.725 -1.123 -3.190 3.251 -4.862 -0.685 2.360 5.432 -2.508 -7.250 -5.571 0.679 4.391 -3.424 -0.273 -4.233 1.505 1.570 -3.372 -1.288 4.813 -2.778 -2.350 0.684 0.351 -5.729 -5.093 0.439 -3.167 -2.713 -0.593 3.229 1.316 2.283 1.152 0
18907 -0.119 NaN 3.658 -1.232 1.947 -0.119 0.652 -1.490 -0.034 -2.557 -2.094 2.939 -0.489 -3.372 -0.236 -2.676 1.934 1.647 -0.603 -2.326 -1.779 -0.466 -2.086 0.333 0.671 5.423 -1.576 -1.345 0.404 -2.333 0.960 -4.670 -0.594 -1.651 -1.405 1.531 1.079 2.833 1.451 3.233 0
In [311]:
# Check for missing values in the test data set
test_results = data_test.isnull().sum()
test_results[test_results>0]
Out[311]:
V1    5
V2    6
dtype: int64

Observations¶

  • The missing data appears to occur throughout the dataset. No discernable patterns in the missing data.
  • Both the data (eventually training and validation) has records to impute.
  • The test data has records to impute.

Exploratory Data Analysis (EDA)¶

Plotting histograms and boxplots for all the variables¶

In [312]:
# Loop through all sensor features and create a histogram and boxplot for each feature.
for feature in data.columns:
    histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None) ## Please change the dataframe name as you define while reading the data

Observations¶

  • Most features resemble a normal distrubtion.
  • All variable sensor features contain outliers on both ends.
  • Outlier data likely represents good data and will not be treated.

  • Some features have slight skewness:

    • Right-skewed features: V1, V18, V27, V37
    • Left-skewed feature: V8, V10, V30, V34
In [313]:
# Let's barplot the Target feature from the training/validation data set.
labeled_barplot(data, feature="Target", perc=True)

Observations¶

  • Failures represent 5.5% of the training/validation data.
  • Non-failures represent 94.5% of the training/validation data.
In [314]:
# Let's barplot the Target feature from the test dataset
labeled_barplot(data_test, feature="Target", perc=True)

Observations¶

  • Failures represent 5.6% of the Test data.
  • Non-failures represent 94.4% of the Test data.

Multivariate analysis¶

  • These plots were added after the final model was selected in the influential columns were identified.
  • Let's see how theses influential features impact the Target feature.
In [315]:
# Let's create multiple barplots of important features vs Target
cols = data[['V3','V15','V18','V36','V39']].columns.tolist()
plt.figure(figsize=(10,10))

# Loop through each important feature
for i, variable in enumerate(cols):
                     plt.subplot(3,2,i+1)
                     sns.boxplot(x="Target",y=variable,data=data,palette="PuBu",showfliers=False)
                     plt.tight_layout()
                     plt.title(variable)
plt.show()
In [316]:
# Display the numeric fields in a heatmap to determine if there are any correlations between features
columns_of_interest = ['V3', 'V15', 'V18', 'V36', 'V39','Target']
plt.figure(figsize=(12, 7))
sns.heatmap(
    data[columns_of_interest].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
In [317]:
# Let's create a pair plot for those features of interests using Target as hue
columns_of_interest = ['V3', 'V15', 'V18', 'V36', 'V39','Target']
df_selected = data[columns_of_interest]

# Create a pair plot
sns.pairplot(df_selected, hue="Target")
Out[317]:
<seaborn.axisgrid.PairGrid at 0x1c2fd71d0>

Observations¶

  • There is a moderate negative correlation between V15 and V18.
  • There is a moderate positive correlation between V36 and V39.

Data Pre-processing¶

In [318]:
# Separating target variable and other variables
X = data.drop(columns="Target")
X = pd.get_dummies(X)

Y = data["Target"]
In [319]:
#Check the size of the data
print(f"There are {X.shape[0]} rows and {X.shape[1]} features in the data frame.")
There are 20000 rows and 40 features in the data frame.
In [320]:
# Let's now split the Data set into training and validation data
X_train, X_val, y_train, y_val = train_test_split(
    X, Y, test_size=0.25, random_state=1, stratify=Y
)

#Check the size of the data
print(f"There are {X_train.shape[0]} rows and {X_train.shape[1]} features in the data frame.")

#Check the size of the data
print(f"There are {X_val.shape[0]} rows and {X_val.shape[1]} features in the data frame.")
There are 15000 rows and 40 features in the data frame.
There are 5000 rows and 40 features in the data frame.
In [321]:
# Let's prepare the test data set now
# Separating target variable and other variables
X_test = data_test.drop(columns="Target")
y_test = data_test["Target"]

# The test data is comprised of only sensor data that is numerical data. No need to create dummy features
In [322]:
#Check the size of the data
print(f"There are {X_test.shape[0]} rows and {X_test.shape[1]} features in the data frame.")
There are 5000 rows and 40 features in the data frame.

Missing value imputation¶

In [323]:
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="median")
In [324]:
# Fit and transform the train data
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)

# Transform the validation data
# Using fit here will cause data leakage
X_val = pd.DataFrame(imputer.transform(X_val), columns=X_train.columns)

# Transform the test data
# Using fit here will cause darta leakage
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_train.columns)
In [325]:
# Checking that no column has missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
------------------------------
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
------------------------------
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64

Observations¶

  • There are no missing values. All values were imputed with the median.

Model Building¶

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [326]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1

        },
        index=[0],
    )

    return df_perf

Defining scorer to be used for cross-validation and hyperparameter tuning¶

  • We want to reduce false negatives and will try to maximize "Recall".
  • To maximize Recall, we can use Recall as a scorer in cross-validation and hyperparameter tuning.
In [327]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

Model Building with original data¶

Let's start by building different models using KFold and cross_val_score and tune the best model using GridSearchCV and RandomizedSearchCV

  • Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.
In [328]:
%%time 

models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Dtree-Reg", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging-Reg", BaggingClassifier(random_state=1)))
models.append(("Adaboost-Reg", AdaBoostClassifier(random_state=1)))
models.append(("GBM-Reg", GradientBoostingClassifier(random_state=1)))
models.append(("RandomForest-Reg", RandomForestClassifier(random_state=1)))
models.append(("Xgboost-Reg", XGBClassifier(random_state=1, eval_metric="logloss")))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")
for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
    
Cross-Validation performance on training dataset:

Dtree-Reg: 0.6982829521679532
Bagging-Reg: 0.7210807301060529
Adaboost-Reg: 0.6309140754635308
GBM-Reg: 0.7066661857008874
RandomForest-Reg: 0.7235192266070268
Xgboost-Reg: 0.8100497799581561

Validation Performance:

Dtree-Reg: 0.7050359712230215
Bagging-Reg: 0.7302158273381295
Adaboost-Reg: 0.6762589928057554
GBM-Reg: 0.7230215827338129
RandomForest-Reg: 0.7266187050359713
Xgboost-Reg: 0.8309352517985612
CPU times: user 3min 16s, sys: 7.71 s, total: 3min 24s
Wall time: 3min 10s
In [329]:
print(f"Test Data Recall scores fitted on training data set")
for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_test, model.predict(X_test))
    print("{}: {}".format(name, scores))
Test Data Recall scores fitted on training data set
Dtree-Reg: 0.7127659574468085
Bagging-Reg: 0.6595744680851063
Adaboost-Reg: 0.6134751773049646
GBM-Reg: 0.6914893617021277
RandomForest-Reg: 0.7304964539007093
Xgboost-Reg: 0.8049645390070922

Observations¶

  • Currently, the Xgboost model is performing the best with a cross validation score of 0.81 and a recall score of 0.83.
In [330]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()
  • We can see that Xgboost is giving the highest cross-validated recall followed by Bagging and Random Forest

  • We will tune the Xgboost, Bagging, and Random Forest models and see if the performance improves

Model Building with Oversampled data¶

In [331]:
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 832
Before Oversampling, counts of label 'No': 14168 

After Oversampling, counts of label 'Yes': 14168
After Oversampling, counts of label 'No': 14168 

After Oversampling, the shape of train_X: (28336, 40)
After Oversampling, the shape of train_y: (28336,) 

In [332]:
%%time 

models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Dtree-Over", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging-Over", BaggingClassifier(random_state=1)))
models.append(("Adaboost-Over", AdaBoostClassifier(random_state=1)))
models.append(("GBM-Over", GradientBoostingClassifier(random_state=1)))
models.append(("RandomForest-Over", RandomForestClassifier(random_state=1)))
models.append(("Xgboost-Over", XGBClassifier(random_state=1, eval_metric="logloss")))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

Dtree-Over: 0.9720494245534969
Bagging-Over: 0.9762141471581656
Adaboost-Over: 0.8978689011775473
GBM-Over: 0.9256068151319724
RandomForest-Over: 0.9839075260047615
Xgboost-Over: 0.9891305241357218

Validation Performance:

Dtree-Over: 0.7769784172661871
Bagging-Over: 0.8345323741007195
Adaboost-Over: 0.8561151079136691
GBM-Over: 0.8776978417266187
RandomForest-Over: 0.8489208633093526
Xgboost-Over: 0.8669064748201439
CPU times: user 5min 27s, sys: 10.4 s, total: 5min 38s
Wall time: 5min 21s
In [333]:
print(f"Test Data Recall scores using oversampled data sets")
for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_test, model.predict(X_test))
    print("{}: {}".format(name, scores))
Test Data Recall scores using oversampled data sets
Dtree-Over: 0.7659574468085106
Bagging-Over: 0.7695035460992907
Adaboost-Over: 0.8191489361702128
GBM-Over: 0.8546099290780141
RandomForest-Over: 0.8333333333333334
Xgboost-Over: 0.8368794326241135
In [334]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

Observations using over sampled data¶

  • We can see that Xgboost is giving the highest cross-validated recall followed by Random Forest and Bagging

Model Building with Undersampled data¶

In [335]:
print("Before Undersampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Undersampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)


print("After Undersampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Undersampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Undersampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Undersampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Undersampling, counts of label 'Yes': 832
Before Undersampling, counts of label 'No': 14168 

After Undersampling, counts of label 'Yes': 832
After Undersampling, counts of label 'No': 832 

After Undersampling, the shape of train_X: (1664, 40)
After Undersampling, the shape of train_y: (1664,) 

In [336]:
%%time 

models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Dtree-Under", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging-Under", BaggingClassifier(random_state=1)))
models.append(("Adaboost-Under", AdaBoostClassifier(random_state=1)))
models.append(("GBM-Under", GradientBoostingClassifier(random_state=1)))
models.append(("RandomForest-Under", RandomForestClassifier(random_state=1)))
models.append(("Xgboost-Under", XGBClassifier(random_state=1, eval_metric="logloss")))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

Dtree-Under: 0.8617776495202367
Bagging-Under: 0.8641945025611427
Adaboost-Under: 0.8666113556020489
GBM-Under: 0.8990621167303946
RandomForest-Under: 0.9038669648654498
Xgboost-Under: 0.9014717552846114

Validation Performance:

Dtree-Under: 0.841726618705036
Bagging-Under: 0.8705035971223022
Adaboost-Under: 0.8489208633093526
GBM-Under: 0.8884892086330936
RandomForest-Under: 0.8920863309352518
Xgboost-Under: 0.89568345323741
CPU times: user 20.4 s, sys: 5.37 s, total: 25.8 s
Wall time: 16.2 s
In [337]:
print(f"Test Data Recall scores for undersampled data sets")
for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_test, model.predict(X_test))
    print("{}: {}".format(name, scores))
Test Data Recall scores for undersampled data sets
Dtree-Under: 0.8085106382978723
Bagging-Under: 0.8581560283687943
Adaboost-Under: 0.8546099290780141
GBM-Under: 0.8687943262411347
RandomForest-Under: 0.875886524822695
Xgboost-Under: 0.875886524822695
In [338]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

Observations using over sampled data¶

  • We can see that Xgboost is giving the highest cross-validated recall followed by Random Forest and GBM.

HyperparameterTuning¶

Note¶

  1. Sample parameter grid has been provided to do necessary hyperparameter tuning. One can extend/reduce the parameter grid based on execution time and system configuration to try to improve the model performance further wherever needed.
  2. The models chosen for Hyperparameter Tuning are as follows:
    • Xgboost-Reg (Using regular data sets)
      • Had a decent cross-validation score (.810), recall score against val data (.831), and recall score against test data (.805)
      • Had a maximum recall score (using validation data) of 0.86+ (has strong potential for tuning)
      • Adding to list to hypertune to see what performance gains can be achieved through tuning.
    • GBM-Under (Using undersampled data sets)
      • Had a decent cross-validation score (.899), recall score against val data (.888), and recall score against test data (.869)
      • The above three scores are all relatively close to each other indicating minimal overfitting and/or underfitting.
    • RandomForest-Under (Using undersampled data sets)
      • Had a decent cross-validation score (.904), recall score against val data (.892), and recall score against test data (.876)
      • The above three scores are all relatively close to each other indicating minimal overfitting and/or underfitting.
    • Xgboost-Under (Using undersampled data sets)
      • Had a decent cross-validation score (.901), recall score against val data (.896), and recall score against test data (.876)
        • The above three scores are all relatively close to each other indicating minimal overfitting and/or underfitting.
    • Models using oversampled data were not selected because of the following reasons:
      • Overall these models' recall scores against test data were lower than the undersampled recall scores
      • These models also scored very high using the training data (.92+) and there is large concern these models are too overfitted for production.

Sample Parameter Grids¶

Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.

  • For Gradient Boosting:

param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }

  • For Random Forest:

param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }

  • For XGBoost:

param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }

Tuning XGBoost using regular data¶

In [339]:
%%time

# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')

#Parameter grid to pass in RandomSearchCV
param_grid={
    'n_estimators':[150,200,250],
    'scale_pos_weight':[5,10], 
    'learning_rate':[0.1,0.2], 
    'gamma':[0,3,5], 
    'subsample':[0.8,0.9]}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model, 
    param_distributions=param_grid, 
    n_iter=50, 
    n_jobs = -1, 
    scoring=scorer, 
    cv=5, 
    random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.2, 'gamma': 5} with CV score=0.8582136930957363:
CPU times: user 2.9 s, sys: 1.68 s, total: 4.58 s
Wall time: 1min 3s
In [340]:
# Create an XGB tuned classifier for the regular/original data set.
xgb2_reg_tuned = XGBClassifier(
    random_state=1,
    eval_metric="logloss",
    subsample=0.8,
    scale_pos_weight=10,
    n_estimators=200,
    learning_rate=0.2,
    gamma=5,
)

# Fit using the training data set.
xgb2_reg_tuned.fit(X_train, y_train)
Out[340]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=5, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.2, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=200,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=5, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.2, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=200,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In [341]:
# Check the model performance using the training data set.
xgb2_reg_tuned_train_perf = model_performance_classification_sklearn(
    xgb2_reg_tuned, X_train, y_train
)
xgb2_under_tuned_train_perf
Out[341]:
Accuracy Recall Precision F1
0 0.979 1.000 0.960 0.979
In [342]:
# Check the model performance using the validation data set.
xgb2_reg_tuned_val_perf =  model_performance_classification_sklearn(
    xgb2_reg_tuned, X_val, y_val
)
xgb2_reg_tuned_val_perf
Out[342]:
Accuracy Recall Precision F1
0 0.987 0.849 0.911 0.879
In [343]:
# Check the model performance using the test data set.
xgb2_reg_tuned_test_perf =  model_performance_classification_sklearn(
    xgb2_reg_tuned, X_test, y_test
)
xgb2_reg_tuned_test_perf
Out[343]:
Accuracy Recall Precision F1
0 0.987 0.837 0.922 0.877

Tuning XGBoost using undersampled data¶

In [344]:
%%time

# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')

#Parameter grid to pass in RandomSearchCV
param_grid={
    'n_estimators':[150,200,250],
    'scale_pos_weight':[5,10], 
    'learning_rate':[0.1,0.2], 
    'gamma':[0,3,5], 
    'subsample':[0.8,0.9]}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model, 
    param_distributions=param_grid, 
    n_iter=50, 
    n_jobs = -1, 
    scoring=scorer, 
    cv=5, 
    random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.9290599523843879:
CPU times: user 1.71 s, sys: 1.11 s, total: 2.83 s
Wall time: 22.8 s
In [345]:
# Create an XGB tuned classifier for the underfitted data set.
xgb2_under_tuned = XGBClassifier(
    random_state=1,
    eval_metric="logloss",
    subsample=0.9,
    scale_pos_weight=10,
    n_estimators=200,
    learning_rate=0.1,
    gamma=5,
)

xgb2_under_tuned.fit(X_train_un, y_train_un)
Out[345]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=5, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=200,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=5, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=200,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In [346]:
# Check the model performance using the training underfitted data set.
xgb2_under_tuned_train_perf = model_performance_classification_sklearn(
    xgb2_under_tuned, X_train_un, y_train_un
)
xgb2_under_tuned_train_perf
Out[346]:
Accuracy Recall Precision F1
0 0.979 1.000 0.960 0.979
In [347]:
# Check the model performance using the validation underfitted data set.
xgb2_under_tuned_val_perf =  model_performance_classification_sklearn(
    xgb2_under_tuned, X_val, y_val
)
xgb2_under_tuned_val_perf
Out[347]:
Accuracy Recall Precision F1
0 0.832 0.921 0.239 0.379
In [348]:
# Check the model performance using the test underfitted data set.
xgb2_under_tuned_test_perf =  model_performance_classification_sklearn(
    xgb2_under_tuned, X_test, y_test
)
xgb2_under_tuned_test_perf
Out[348]:
Accuracy Recall Precision F1
0 0.834 0.890 0.239 0.376

Tuning GBM using undersampled data¶

In [349]:
%%time

# defining model
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(100,150,25),
    "learning_rate": [0.2, 0.05, 1],
    "subsample":[0.5,0.7],
    "max_features":[0.5,0.7]
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model, 
    param_distributions=param_grid, 
    n_iter=50, 
    n_jobs = -1, 
    scoring=scorer, 
    cv=5, 
    random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.5, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2} with CV score=0.9038236779453142:
CPU times: user 993 ms, sys: 235 ms, total: 1.23 s
Wall time: 19.2 s
In [350]:
# Create an Gradient Boosting tuned classifier for the underfitted data set.
gbm_under_tuned = GradientBoostingClassifier(
    random_state=1,
    subsample=0.5,
    n_estimators=125,
    max_features=0.7,
    learning_rate=0.2,
)

gbm_under_tuned.fit(X_train_un, y_train_un)
Out[350]:
GradientBoostingClassifier(learning_rate=0.2, max_features=0.7,
                           n_estimators=125, random_state=1, subsample=0.5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.2, max_features=0.7,
                           n_estimators=125, random_state=1, subsample=0.5)
In [351]:
# Check the model performance using the training underfitted data set.
gbm_under_tuned_train_perf = model_performance_classification_sklearn(
    xgb2_under_tuned, X_train_un, y_train_un
)
gbm_under_tuned_train_perf
Out[351]:
Accuracy Recall Precision F1
0 0.979 1.000 0.960 0.979
In [352]:
# Check the model performance using the validation underfitted data set.
gbm_under_tuned_val_perf =  model_performance_classification_sklearn(
    xgb2_under_tuned, X_val, y_val
)
gbm_under_tuned_val_perf
Out[352]:
Accuracy Recall Precision F1
0 0.832 0.921 0.239 0.379
In [353]:
# Check the model performance using the test underfitted data set.
gbm_under_tuned_test_perf =  model_performance_classification_sklearn(
    xgb2_under_tuned, X_test, y_test
)
gbm_under_tuned_test_perf
Out[353]:
Accuracy Recall Precision F1
0 0.834 0.890 0.239 0.376

Tuning Rain Forest using undersampled data¶

In [354]:
%%time

# defining model
Model = RandomForestClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": [200,250,300],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model, 
    param_distributions=param_grid, 
    n_iter=50, 
    n_jobs = -1, 
    scoring=scorer, 
    cv=5, 
    random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 2, 'max_samples': 0.5, 'max_features': 'sqrt'} with CV score=0.8990116153235697:
CPU times: user 1.27 s, sys: 186 ms, total: 1.46 s
Wall time: 25.9 s
In [355]:
# Create an Random Forest tuned classifier for the underfitted data set.
randomforest_under_tuned = RandomForestClassifier(
    random_state=1,
    n_estimators=300,
    min_samples_leaf=2,
    max_samples=.5,
    max_features='sqrt',
)

randomforest_under_tuned.fit(X_train_un, y_train_un)
Out[355]:
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=300,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=300,
                       random_state=1)
In [356]:
# Check the model performance using the training underfitted data set.
randomforest_under_tuned_train_perf = model_performance_classification_sklearn(
    randomforest_under_tuned, X_train_un, y_train_un
)
randomforest_under_tuned_train_perf
Out[356]:
Accuracy Recall Precision F1
0 0.961 0.933 0.989 0.960
In [357]:
# Check the model performance using the validation underfitted data set.
randomforest_under_tuned_val_perf =  model_performance_classification_sklearn(
    randomforest_under_tuned, X_val, y_val
)
randomforest_under_tuned_val_perf
Out[357]:
Accuracy Recall Precision F1
0 0.938 0.885 0.468 0.612
In [358]:
# Check the model performance using the test underfitted data set.
randomforest_under_tuned_test_perf =  model_performance_classification_sklearn(
    randomforest_under_tuned, X_test, y_test
)
randomforest_under_tuned_test_perf
Out[358]:
Accuracy Recall Precision F1
0 0.944 0.879 0.500 0.638

Model performance comparison and choosing the final model¶

Training set final performance¶

In [359]:
# training performance comparison
models_train_comp_df = pd.concat(
    [
        xgb2_reg_tuned_train_perf.T,
        xgb2_under_tuned_train_perf.T,
        gbm_under_tuned_train_perf.T,
        randomforest_under_tuned_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Xgboost tuned with regular data",
    "Xgboost tuned with undersampled data",
    "Gradient Boost tuned with undersampled data",
    "Random Forest tuned with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[359]:
Xgboost tuned with regular data Xgboost tuned with undersampled data Gradient Boost tuned with undersampled data Random Forest tuned with undersampled data
Accuracy 0.999 0.979 0.979 0.961
Recall 1.000 1.000 1.000 0.933
Precision 0.974 0.960 0.960 0.989
F1 0.987 0.979 0.979 0.960

Validation set final performance¶

In [360]:
# training performance comparison
models_val_comp_df = pd.concat(
    [
        xgb2_reg_tuned_val_perf.T,
        xgb2_under_tuned_val_perf.T,
        gbm_under_tuned_val_perf.T,
        randomforest_under_tuned_val_perf.T,
    ],
    axis=1,
)
models_val_comp_df.columns = [
    "Xgboost tuned with regular data",
    "Xgboost tuned with undersampled data",
    "Gradient Boost tuned with undersampled data",
    "Random Forest tuned with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
Out[360]:
Xgboost tuned with regular data Xgboost tuned with undersampled data Gradient Boost tuned with undersampled data Random Forest tuned with undersampled data
Accuracy 0.987 0.832 0.832 0.938
Recall 0.849 0.921 0.921 0.885
Precision 0.911 0.239 0.239 0.468
F1 0.879 0.379 0.379 0.612

Test set final performance¶

Now we have our final model, so let's find out how our final model is performing on unseen test data.

In [361]:
# training performance comparison
models_test_comp_df = pd.concat(
    [
        xgb2_reg_tuned_test_perf.T,
        xgb2_under_tuned_test_perf.T,
        gbm_under_tuned_test_perf.T,
        randomforest_under_tuned_test_perf.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Xgboost tuned with regular data",
    "Xgboost tuned with undersampled data",
    "Gradient Boost tuned with undersampled data",
    "Random Forest tuned with undersampled data",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[361]:
Xgboost tuned with regular data Xgboost tuned with undersampled data Gradient Boost tuned with undersampled data Random Forest tuned with undersampled data
Accuracy 0.987 0.834 0.834 0.944
Recall 0.837 0.890 0.890 0.879
Precision 0.922 0.239 0.239 0.500
F1 0.877 0.376 0.376 0.638

Observations¶

  • The selected model is the Random Forest tuned with undersampled data

    • This model had a training recall score of: .933
    • This model had a validation recall score of: .885
    • This model had a test recall score: .879
  • The Random Forest had very good recall scores across the various sets of data suggesting it is minimizing overfitting and underfitting scenarios compared to the other models that may be scoring higher in some areas.

  • Concerned with the other higher performing test set recall scores because the higher performing models have training set recall score of 1.0.

Feature Importances¶

In [362]:
feature_names = X_train.columns
importances =  randomforest_under_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Pipelines to build the final model¶

  • Now that we have a final model, let's use pipelines to put the model into production. We know that we can use pipelines to standardize the model building, but the steps in a pipeline are applied to each and every variable.
In [363]:
# Let's prepare a new data set that doesn't have imputed data, etc.
# Separating target variable and other variables
X1 = data.drop(columns="Target")
Y1 = data["Target"]

# Since we already have a separate test set, we don't need to divide data into train and test

X_test1 = df_test.drop(columns='Target') 
y_test1 = df_test['Target']
In [364]:
#Check the size of the data
print(f"There are {X1.shape[0]} rows and {X1.shape[1]} features in the data frame.")
There are 20000 rows and 40 features in the data frame.
In [365]:
#Check the size of the data
print(f"There are {Y1.shape[0]} features in the data frame.")
There are 20000 features in the data frame.
In [366]:
#Check the size of the data
print(f"There are {X_test1.shape[0]} rows and {X_test1.shape[1]} features in the data frame.")
There are 5000 rows and 40 features in the data frame.
In [367]:
#Check the size of the data
print(f"There are {y_test1.shape[0]} rows features in the data frame.")
There are 5000 rows features in the data frame.
In [368]:
# We need to impute the X1 (basically the training and validation together) data set to fix missing values.
imputer = SimpleImputer(strategy="median")
X1 = imputer.fit_transform(X1)
In [369]:
# Since the Random Forest using undersampling was selected as the final model, we need to also do 
# underfitting with X1 and Y1
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_under1, y_under1 = rus.fit_resample(X1, Y1)
In [370]:
# Create the pipeline model that imputes the data and then creates the final Random Forest model.
Pipeline_model = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")),
                                 ("RandomForest_under", RandomForestClassifier(
                                    random_state=1,
                                    n_estimators=300,
                                    min_samples_leaf=2,
                                    max_samples=.5,
                                    max_features='sqrt',)), 
                                ])
In [371]:
# Fit the model on the undersampled training data
# X_under1 and Y_under1 will run through the pipeline.  These data sets will be imputed just in the pipeline.
# However, the original X_under1 will not be updated due to imputed step.
Pipeline_model.fit(X_under1, y_under1)
Out[371]:
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('RandomForest_under',
                 RandomForestClassifier(max_samples=0.5, min_samples_leaf=2,
                                        n_estimators=300, random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('RandomForest_under',
                 RandomForestClassifier(max_samples=0.5, min_samples_leaf=2,
                                        n_estimators=300, random_state=1))])
SimpleImputer(strategy='median')
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=300,
                       random_state=1)
In [372]:
# Let's check the performance on test set
Model_test = model_performance_classification_sklearn(Pipeline_model, X_test1, y_test1)
Model_test
Out[372]:
Accuracy Recall Precision F1
0 0.945 0.872 0.507 0.641

Business Insights and Conclusions¶

  • The final model selected is the Random Forest Model using underfitted data sets.
  • The Random Forest - Under model has a recall score of 0.872.
  • The Random Forest - Under model can be used to help predict component failures.

  • Recommend ReneWind focus on these important features that can help determine failures:

    • V3: Lower values have higher probability of causing a failure.
    • V18: Lower values have higher probability of causing a failure.
    • V36: Lower values have higher probability of causing a failure.
    • V39: Lower values have higher probability of causing a failure.
    • V15: Higher values have higher probability of causing a failure.
  • Recommend ReneWind consider future feature engineering efforts of combining features with correlations such as: V15 and V18.

  • Recommend ReneWind consider future feature engineering efforts to combining features with correlations such as: V36 and V39.
  • Recommend ReneWind consider incorporating time feature data to help determine if time-of-day is a factor in component failures.